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ABSTRACT 

Coexpressed gene databases are valuable re- 
sources for identifying new gene functions or func- 
tional modules in metabolic pathways and signaling 
pathways. Although coexpressed gene databases 
are a fundamental platform in the field of plant 
biology, their use in animal studies is relatively 
limited. The COXPRESdb (http://coxpresdb.jp) 
provides coexpression relationships for multiple 
animal species, as comparisons of coexpressed 
gene lists can enhance the reliability of gene 
coexpression determinations. Here, we report the 
updates of the database, mainly focusing on the fol- 
lowing two points. First, we updated our 
coexpression data by including recent microarray 
data for the previous seven species (human, 
mouse, rat, chicken, fly, zebrafish and nematode) 
and adding four new species (monkey, dog, 
budding yeast and fission yeast), along with a new 
human microarray platform. A reliability scoring 
function was also implemented, based on 
coexpression conservation to filter out 
coexpression with low reliability. Second, the 
network drawing function was updated, to imple- 
ment automatic cluster analyses with enrichment 
analyses in Gene Ontology and in cis elements, 
along with interactive network analyses with 
Cytoscape Web. With these updates, COXPRESdb 
will become a more powerful tool for analyses of 
functional and regulatory networks of genes in a 
variety of animal species. 



INTRODUCTION 

The construction of a gene network is a fundamental step 
toward understanding global cellular processes. In 
addition, recent genome-wide association studies, using 
high-throughput sequencing technology, have revealed 
many uncharacterized genotypes associated with a particu- 
lar phenotype (1,2). To investigate the molecular mechan- 
isms underlying the connections between genotype and 
phenotype, networks of mRNAs or proteins are useful. 
Several databases, such as IntAct (3) and STRING (4), 
have focused on protein-protein interaction network con- 
struction. For mRNA network analysis, similarities of gene 
expression profiles (gene coexpression) of a vast amount of 
microarray data are constructed. Databases for gene co- 
expression have achieved great success in the field of plant 
biology (5-8). On the other hand, however, their use in mam- 
malian fields is still limited, with some exceptional reports 
(9,10), although several coexpression databases, such as 
Genevestigator (11), STARNET2 (12), SNPxGE 2 (2) and 
ours, COXPRESdb, have been developed. 

To promote the use of coexpression analyses in animals, 
we have been developing a gene coexpression database 
named COXPRESdb (coe xpres sion database). We have es- 
pecially focused on the reliability of coexpression data, by 
providing comparisons of coexpression among the different 
species, along with a network view of the relationships 
between coexpressed genes (13,14). Although the gene 
network view can provide an overview for the system of 
interest, the construction of a large-scale gene network is 
not easy because such a network tends to be too 
complicated to fully comprehend. Several approaches 
have been developed to visualize and help the understand- 
ing of large-scale gene networks, by controlling the cluster 
size (15) or combining biological-property-based clustering 
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Table 1. Summary of the update of the coexpression data from versions 4.1 to 5.0 



Species 


Abbreviation 


Microcirrsy platform (Affymetrix product ID) 


Number of genes 


Number of 


micro3.rr3.ys 










ver. 5.0 


ver. 4.1 


tloiiio sapiens 


Hsa 


ur. t n x\ Pine 1 


19 803 


7^ ns^ (rd rn 

/ J Uoj Itt.U ) 




Homo sapiens 


Hsa2 


IT LIVJCllC 1 U-bl-Vl 


19 788 


OOOJ IL-l .U ) 




Mus musculus 


Mmu 


Mouse430 2 


20403 


31479 (c3.0) 


2226 (c2.1) 


Rattus norvegicus 


Rno 


Rat230_2 


13751 


27481 (c3.0) 


3526 (c2.0) 


Gallus gallus 


Gga 


Chicken 


13757 


1024 (c2.0) 


352 (cl.0) 


Danio rerio 


Dre 


Zebrafish 


10112 


1126 (c2.0) 


590 (cl.0) 


Drosophila melanogaster 


Dme 


Drosophila_2 


12 626 


3336 (c2.0) 


1102 (cl.0) 


Caenorhabditis elegans 


Cel 


Celegans 


17256 


1034 (c2.0) 


514 (cl.0) 


Macaca mulatto 


Mcc 


Rhesus 


15779 


675 (cl.0) 




Canis lupus 


Cfa 


Canine_2 


16211 


377 (cl.0) 




Saccharomyces cere visiae 


See 


Yeast_2 


4461 


2693 (cl.0) 




Schizosaccharomyces pombe 


Spo 


Yeast_2 


4881 


111 (cl.0) 





"c" is added for each coexpression version (e.g. c4.0) to prevent confusions with the COXPRESdb version as a whole (e.g. ver. 5.0). 



(16). Another weak point in coexpressed gene network 
analysis is based on the quality of the coexpression data. 
The quality of the coexpression data for animals is gener- 
ally worse than that for Arabidopsis in an assessment using 
Gene Ontology (GO) annotation (17), probably due to the 
increased complexity of animal systems (18). 

To enhance the performance of gene coexpression 
analyses, we updated two aspects of COXPRESdb. 
First, we increased the number of samples for each 
species and the number of species from 7 to 11 along 
with an alternative microarray platform for human as 
summarized in Table 1. In addition, a reliability scoring 
system was implemented, based on the similarity of 
coexpression patterns among the species. Second, the 
network drawing tool was improved. The new tool auto- 
matically divides the large complex network into smaller 
compact clusters. Each compact cluster is then 
characterized by GO and cis element enrichment 
analyses. In addition, users can select the Cytoscape web 
system (19) to interactively modify the network alignment 
and to work as a bridge to stand-alone Cytoscape (20) for 
more complex analyses. Furthermore, all of the 
coexpression data are now available in SPARQL for the 
semantic web communities, using the Virtuoso Universal 
Server at [http://coxpresdb.jp/sparql], which will promote 
building mashup application with various omics data sets. 



QUALITY ASSESSMENT OF COEXPRESSION DATA 

New coexpression data 

The calculation procedure for the coexpression data is the 
same as in our previous report (18). Briefly, GeneChip raw 
data were obtained from Array Express (21) and 
normalized by the RMA method (22) for each compressed 
file, by assuming that each compressed file corresponds to 
each experimental set. Then the weighted Pearson's cor- 
relation coefficient of expression profiles was calculated 
for every pair of genes in each species. Finally, the correl- 
ation coefficient was transferred to mutual rank (MR) 
(18). A network node corresponds to a gene, and edges 
are drawn for each gene to the other genes with three most 
strongly coexpressed genes. The evolutional relationships 



were determined by using HomoloGene (23) and the edges 
in the homologous gene pairs, if any, were considered as 
common edges among the species. 

To assess the difference between the previous and new 
versions, we counted the numbers of common edges (Nc) 
for all pairs of seven species for each version. These numbers 
provide a quick measure to evaluate the quality of the 
coexpression data because similar coexpression from inde- 
pendent microarray platforms may eliminate experimental 
artifact of gene coexpression. As a result, all pairs of species, 
except for the human-nematode pair, showed an increase in 
Nc (Figure 1). The average increase rate of Nc was 1.5, and 
large increases of Nc were observed for the human-mouse, 
mouse-rat and mouse-chicken pairs, which may corres- 
pond to the large increase in the number of mouse 
samples. In addition to the data renewal of the previous 
seven species, we added four new species, monkey, dog 
and two yeast species, as well as human coexpression from 
another microarray platform. The numbers of Nc against 
the human data are summarized in Table 2. 



SIMILARITY OF COEXPRESSION PATTERNS 
AMONG SPECIES 

The coexpressed gene list in COXPRESdb provides a 
comparable view among orthologous genes in other 
species (14). This comparative view shows the evolutional 
conservation of the coexpression pattern of the guide gene, 
which can be a measure of the reliability of the 
coexpression data (24,25). Figure 2 shows the coexpressed 
gene list for the human CHEK1 gene. The alternative 
human platform (Hsa2) and mouse (Mmu) show similar 
coexpression degrees with the human (Hsa) coexpression, 
reflecting the high quality of the coexpression data for 
these species, based on the large amount of microarray 
data. The conservation degrees with monkey (Mcc), rat 
(Rno), dog (Cfa) and zebrafish (Dre) are also good. The 
low coexpression conservation with fly (Dme), nematode 
(Cel) and the two yeast species (See, Spo) seems to be 
derived from the greater species distance to human and/ 
or the relatively poor coexpression data based on the small 
amount of microarray data (Table 1). In particular, the 
chicken (Gga) coexpression data are different from the 
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Figure 1. Distribution of the number of common coexpression edges (Nc) between species. Large increases in common coexpression edges are 
observed in the (a) human-mouse, (b) mouse-rat and (c) mouse-chicken pairs, suggesting significant improvement of the mouse coexpression data. 
The increase rate of the number of common edges is 1.5 on average. 



Table 2. Evolution of number of edges in a human platform 
commonly observed in other species 



Species 


ver. 5.0 


ver. 4.1 


Mus musculus 


1397 


757 


Canis lupus 


896 




Rattus norvegicus 


803 


720 


Macaca mulatto 


545 




Gal/us gallus 


358 


21 1 


Danio rerio 


172 


156 


Drosophila melanogaster 


84 


49 


Caenorhabditis elegans 


38 


39 


Saccharomyces cerevisiae 


35 




Sehizosaccharomyces pombe 


13 





The total number of edges in human are 59 409 (ver. 5.0) and 59 331 
(ver. 4.1). 



human data. This may be due to a defective probe for this 
gene because when we checked the coexpressed gene list 
for this gene in chicken, almost no orthologous genes 
showed coexpression conservation. 

As seen in this example, the conservation of 
coexpression can ensure the quality of the guide gene 
(14), but users should check all of the coexpressed genes 
in each species to determine the reliability of each 
orthologous gene. To solve this problem, we introduced 
a similarity measure COXSIM, which is the weighted con- 
cordance rate of the coexpressed gene lists. 

COXSlM { k, g ,s P X,s P 2) = gMj*tagMg9 

Zw=i 



where n(i, g, spl, sp2) is the number of common genes 
(orthologous genes in the case of different species com- 
parison) found in the top i coexpressed gene lists from a 
guide gene g in species spl and that in species spl. We set 
100 for k, meaning that we check the gene correspondence 
of the top 100 coexpressed genes, which is a reasonable 
limitation to design biological experiment (7). 

Here, defective probes will show noisy expression 
patterns, which cause unreliable coexpression that does 
not show any correspondence with other coexpression 
data. In other words, the maximal value of COXSIM 
(coexpression similarity) between the coexpressed gene 
list from an unreliable gene and that from its orthologous 
genes should be low. Based on this idea, maxCOXSIM 
is introduced as the reliability score of a guide gene. 

maxCOXSIM(g,sp\) = max sp2 COXSIM(g,spl,sp2) 

The significance of the maxCOXSIM value is assessed 
from the null distribution for 10 species comparisons, each 
containing 10000 genes. Note that this assumption is a 
rather severe evaluation and thus this P-value is 
underestimated for most guide genes because both the 
larger number of species in the comparison and the 
smaller number of genes in a genome will cause higher 
maxCOXSIM values by chance. We show this significance 
degree by stars on the gene list in COXPRESdb, where 
single, double and triple stars correspond P-values <lE-4, 
1E-12 and 1E-20, respectively. Genes with lower reliability 
can be filtered out by the Row and Column filters 
(Figure 2). The numbers and ratios of genes at each sig- 
nificance level are shown in Figure 3. 
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Top 200 coexpressed genes to CHEK1 (Hsa c4.0 coexpression data) 

Function " Entrez Gene ID] 

Download CSV 



ROW filter: | Show iji genes ; I Column filter: [ Show all spe"cieT 





Gene 


Reliability 


Hsa MR 

for 
CHEK1 


Link 


JL 

Hsa2 MR 
for 
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[list] 

,■ ,.\ 


MCC MR 
for 
LOC7 13358 
[list] 

WWW 


Mmu MR 
for 
Chekl 
[list] 

WWW 


J> \ 

Rno MR 

for 
Chekl 
[list] 

WWW 


*> 

Cfa MR 

for 
CHEK1 

[list] 

WWW 


/ 

Gga MR 

for 
STT3A 

[list] 

w 


**'■ ■ 

Dre MR 

for 
chekl 
[list] 

www 


Dme mr 
for 
grp 
[list] 


-a 

Cel mr 

for 
chk-1 
[list] 

WWW 


•• 

See mr 

for 
CHK1 
[list] 

J- 

w 


■ 

*■ 

SpO MR 

for 
chkl 
[list] 

w 


o 


CHEK1 




0.0 




Q Q 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1 


IVIOIVI 1 \J 




3.3 


!_ \£_ 


55.2 


2.5 


42.2 




1 08.4 






24.7 








2 


HELLS 




4.2 


L. \£_ 


21 .5 


2.0 


1 .7 


8.7 


1 09.4 




45.9 










3 


DTL 




9.6 


U V- 


6.6 


40.1 


4.0 




80.9 




23.5 










4 


MELK 




10.8 


L. [✓ 


6.4 


39.2 


40.9 


1 50.0 


50.6 




1 51 .7 




4.9 






5 






12.0 


L, \£_ 


45.1 




1 27.9 


98.5 






1 00.5 










6 


CDC6 




1 5.0 


L. 


9.0 


6.7 


8.0 




54.4 




1 0.9 


1 00.0 








7 


TIPIN 




1 5.0 


L. [✓ 


76.7 


37.5 


46.9 








1 21 .9 










s 


POLE2 




15.0 


U \C 


87.1 


1 25.5 


31 .8 


1 34.7 


144.0 




17.7 


1 96.0 


360.0 


2979.2 


376.4 


9 


SUV39H2 


-tctrk 


16.9 




86.2 




13.2 


















10 


BUB1 




17.0 


LI/ 


40.1 




22.3 


59.4 
















11 


ORC1 


trtrtr 


17.0 


L.k 


103.9 


27.7 


25.0 


156.1 






13.0 


40.1 








12 


CDCA5 


•kirti 


17.2 




141.0 




26.3 




97.4 














13 


OIP5 


■Mrir 
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154.7 


19.6 


73.9 


















14 


TMEM48 


irtrti 


18.5 




67.1 




94.4 






847.3 
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15 


CENPH 




19.6 


UU: 


105.9 


86.3 


26.7 
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16 


WDHD1 




20.5 


LK 


9.8 


35.7 


14.8 


97.5 


74.0 




1.4 


92.0 


120.8 


1 334.8 




17 


NCAPG 




20.9 


U\£ 


15.6 


22.2 


30.8 




35.5 




51.1 






1130.9 


3095.7 


18 


MND1 




21.6 


UUL 


80.5 




132.3 


182.0 


90.8 




























19 


C4orf46 


•trtrb 


21.6 


U\s 


81.4 
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20 
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22.6 


U\s 
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57.0 


50.0 


67.5 




47.3 


5350.8 









Figure 2. An example of a coexpressed gene list in COXPRESdb. The human CHEK1 gene is used as an example of a guide gene, and the 
coexpressed genes are shown along with their MR values (smaller MR value indicates stronger coexpression). The 11 columns on the right 
indicate the coexpression degrees of the ortholog pairs in other species (or another human platform). Coexpressions with MR >200 are considered 
as weak and they are shown in faded color. A blank cell means that coexpression data are not available for the gene in the corresponding species 
(or a platform). The reliability is calculated based on the coexpression conservation, and is represented with stars (triple star: highly reliable; no star: 
no conservation support). This list is available at [http://coxpresdb.jp/cgi-bin/coex_list. cgi?gene= 111 l&sp = Hsa]. 




Figure 3. Number of genes for each reliability level. Reliability levels are represented as stars, where no star is the lowest and a triple star is the 
highest reliability. Numbers in the bars indicate the percentage of each reliability level in each species, where the numbers with no star include genes 
without any orthlogous genes in other species. 
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Coexpressed gene network for user-defined genes 

, [Cytoscape Web] 
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ChiP-seq 


5.29 


25.6 


MA0065.2 



GO Enrichment 
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Figure 4. Two network analysis flows in NetworkDrawer. For a set of user-defined genes, NetworkDrawer draws the gene network. Larger nodes are 
the query genes and smaller gray nodes are additional nodes with one or more edges to at least one query node. Solid lines and red dotted lines indicate 
gene coexpression and protein-protein interactions from the HPRD (26) and IntAct (3) databases, respectively. The orange solid lines mean conserved 
coexpression observed in at least one species in COXPRESdb. The new NetworkDrawer can be used for the two network analysis flows. The first flow is 
composed of automatic cluster detection (A) and enrichment analyses of cis elements and GO annotations (B) with detailed cis element information 
(C). The second flow is using the Cytoscape Web system (D), which enables the user to interactively modify the network alignment. The user can output 
this network as an image, save it and then load it on this web system, or continue the analysis and visualization on stand-alone Cytoscape. 



ENHANCEMENT OF THE NETWORK 
ANALYSIS TOOL 

The coexpressed gene network is especially useful to 
analyze the large number of genes generated by transcrip- 
tome or proteome analyses because the network represen- 
tation can draw all of the pair-wise gene relationships for 
the query genes at one time. NetworkDrawer in 
COXPRESdb is the tool to draw the gene network for 
the query genes specified by users, by searching for 
coexpression along with protein-protein interactions 
among the genes or gene products (Figure 4). In this 
example, three groups of genes can be identified by 
visual inspection. To characterize these groups, two new 
network analysis flows are provided in the new 
NetworkDrawer, in addition to the marks for KEGG an- 
notation (27) in the previous version of COXPRESdb. 

The first analysis flow is composed of automatic cluster 
detection and characterization (Figure 4A-C). The cluster 
detection step has two parameters, a clique detection par- 
ameter and a clique merge parameter, which are both set 
to 0.5 as the default values, but can be changed through 



the text box on the web page, where smaller clique par- 
ameter and larger merge parameter produce larger 
sub-graph. The clustering algorithm has been newly de- 
veloped for both a rapid response and the detection of a 
clique-like sub-graph, by merging the node with a higher 
PageRank value iteratively (28). The details of the cluster- 
ing algorithm will be described elsewhere. After the clus- 
tering, users can easily select a cluster by using the radio 
button in the cluster summary table, to mark the nodes in 
the selected cluster by balloon icons (the orange balloons 
in Figure 4A). The results of the enrichment analyses for 
each cluster are available from the links in the table 
(Figure 4B). In addition to the GO enrichment analysis, 
we have also provided the cis element motif enrichment 
analysis. Gene coexpression is mainly driven by cis 
elements in the promoter regions, especially the proximal 
promoter region (29). In Arabidopsis, large-scale cis 
element discovery was performed, based on gene 
coexpression (30). Therefore, we performed enrichment 
analyses by a hypergeometric test for heptamer motifs 
on the proximal promoter regions (—200 to +100) 
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around transcription start sites downloaded from DBTSS 
(31). The enriched heptamers are referred to the reported 
cis elements in JASPAR (32) (Figure AC). To further char- 
acterize the heptamers, the enriched GO annotations of 
the genes having the heptamer motif are calculated 
(Figure 4C). 

The second flow of the gene network analysis is the use 
of the Cytoscape Web system (19) (Figure 4D). This 
system enables users to interactively modify the network 
alignment, export the network as an image (SVG, PNG, 
PDF formats) and save it in the XGMML format. The 
XGMML file can be uploaded on the same Cytoscape 
Web system and also used in stand-alone Cytoscape (20) 
for advanced analyses. This system is also available for 
gene networks in the locus page and the GO network 
page in COXPRESdb. 
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